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FINAL  REPORT:  "Spatial  hearing,  attention  and  informational  masking 

in  speech  identification" 


I.  Collaboration 

During  the  award  period  covered  by  this  report  (12-1-2008  through  11-30-2011)  the 
research  groups  at  Boston  University  and  at  Wright-Patterson  Air  Force  Base  have  collaborated 
on  portions  of  the  work  described  in  the  following  progress  report.  These  collaborative  efforts 
have  taken  several  forms  including  consultation  regarding  experimental  design,  jointly 
conducted  experiments,  sharing  of  results,  discussions  of  the  interpretation  and  theoretical 
implications  of  research  findings,  and  the  planning  of  new  studies.  A  brief  overview  of  our 
collaborative  projects  is  given  in  the  following  paragraph  with  specific  examples  provided 
throughout  the  report. 

The  BU  and  WPAFB  groups  routinely  have  held  joint  discussions  of  research  at  the  Spring 
meeting  of  the  Acoustical  Society  of  America,  the  Midwinter  Research  Meeting  of  the 
Association  for  Research  in  Otolaryngology,  and  at  the  annual  Binaural  Bash  conference  held  at 
Boston  University.  Furthermore,  there  have  been  specific  visits  for  the  purpose  of  fostering 
collaborative  research  projects  that  have  occurred  outside  of  these  regular  scientific  group 
meetings.  For  example,  Dr.  Virginia  Best  was  sponsored  by  a  Window  on  Science  Program 
through  the  Asian  Office  of  Aerospace  Research  and  Development  to  visit  the  group  at  WPAFB 
during  August  18-29,  2008,  for  discussions  of  ongoing  collaborative  research.  An  earlier  visit 
and  group  meetings  at  the  larger  scientific  society  conferences  led  to  a  collaboration  described 
in  a  scientific  paper  presented  at  the  2009  meeting  of  ARO.  Work  on  that  project,  and  related 
studies,  continues. 

II.  Progress  towards  Specific  Aims 
II.A.  Tuning  in  the  Spatial  Dimension 

Largely  as  a  consequence  of  the  work  supported  by  this  grant  during  the  previous  award 
period,  our  understanding  of  "spatial  tuning"  in  azimuth  has  advanced  considerably.  The 
following  is  a  summary  of  the  progress  made  toward  this  aim  referring  where  appropriate  to  both 
published  and  unpublished  work  and  also  discussing  areas  that  remain  unclear  or  that  require 
further  study. 

Part  of  the  impetus  for  formulating  this  aim  was  the  Ph.D.  dissertation  of  Nicole  Marrone 
(2007)  indicating  conditions  under  which  highly  selective  spatial  tuning  could  be  observed.  A 
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portion  of  that  work  and  some  extensions  were  subsequently  published  by  Marrone  et  al. 
(2008a, b,c).  Essentially,  the  gist  of  their  findings  was  that  highly  "tuned"  behavioral  responses  to 
sound  sources  located  in  different  spatial  configurations  could  be  observed  under  conditions  in 
which  a  high  degree  of  informational  masking  was  present.  The  conclusion  that  spatial  tuning 
was  related  to  overcoming  informational,  rather  than  energetic,  masking  was  the  key  finding. 
The  reason  that  this  conclusion  is  key  is  that  it  implicated  selective  attention  as  the  basis  for 
spatial  tuning  rather  than  traditional  binaural  analysis  mechanisms  (also  related  see  Best  et  al., 
2005;  Allen  et  al.,  2008).  The  term  "binaural  analysis"  is  often  used  as  a  catch-all  for  any 
binaural  advantage  that  is  not  a  consequence  of  simple  acoustics  (i.e.,  differential  attenuation  of 
sounds  at  the  two  ears  due  to  "head  shadow").  Binaural  analysis  is  most  often  invoked  as  an 
explanation  for  the  "masking  level  difference"  (MLD)  which  provides  a  robust  advantage  for  the 
detection  of  a  low-frequency  tone  in  Gaussian  noise  due  to  interaural  differences  between  the 
tone  and  noise  (for  topical  reviews  see  Durlach  and  Colburn,  1978;  Colburn,  1995;  Stern  and 
Trahiotis,  1996).  The  parallel  reduction  in  target-to-masker  ratio  at  threshold  for  speech 
reception  due  to  similar  manipulations  in  interaural  signal  and  noise  parameters  may  also  be 
related  to  the  same  within-channel  mechanisms  (of.  Levitt  and  Rabiner,  1967;  Zurek,  1993; 
Culling  and  Colburn,  2000;  Culling  et  al.,  2006).  Zurek  (1993),  for  example,  calculates  that 
binaural  analysis  contributes  a  maximum  of  about  3-5  dB  to  the  overall  spatial  release  from 
masking  (SRM:  the  difference  between  target-to-masker  ratio  at  threshold,  T/M,  in  colocated 
and  separated  source  conditions)  for  speech  in  noise  in  an  anechoic  sound  field.  In  realistic 
sound  fields,  however,  in  which  some  reverberation  is  present  the  release  is  less.  Marrone  et  al. 
(2008a)  have  also  found  a  small  SRM  (about  1.5  dB  for  symmetrically-separated  maskers)  in  a 
sound  field  when  the  maskers  were  speech-shaped  noise  that  produced  primarily  energetic 
masking.  Because  that  value  reflects  the  maximum  attenuation  of  the  spatial  filter,  the  effect  of 
tuning  -  whether  sharp  or  broad  -  is  minimal.  However,  when  the  dominant  form  of  masking  is 
informational  masking,  much  larger  SRM  is  often  reported  ranging  from  about  12-18  dB  based 
on  several  findings  from  our  laboratory  (e.g.,  Arbogast  et  al.,  2002;  Marrone  et  al.,  2008a;  Kidd 
et  al.,  2010).  The  bandwidth  of  the  spatial  filter  also  may  be  quite  narrow,  in  the  range  of  10-15°. 
Further  studies  completed  recently  have  helped  to  clarify  this  process.  Kidd  et  al.  (2010;  see 
also  Best  et  al.,  2011)  measured  SRM  for  combinations/  proportions  of  energetic  and 
informational  maskers  at  different  spatial  separations.  They  also  examined  performance  under 
various  filtered  conditions  designed  to  limit  the  availability  of  interaural  time  and  level  cues  (ITDs 
and  ILDs,  respectively).  A  composite  figure  illustrating  some  of  these  findings  is  shown  in 
Figure  1. 
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Fig.  1 
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This  figure  contains  a  schematic  of  a  hypothetical  spatial  filter  (solid  line  with  no  data  points) 
computed  using  the  SRM  values  from  Marrone  et  al.  (2008a)  for  a  three-talker  mixture  with  the 
target  talker  always  directly  in  front  of  the  listener  (0°  azimuth)  and  the  maskers  either  colocated 
with  the  target  or  symmetrically  spatially  separated.  In  addition,  two  other  sets  of  data  are  also 
plotted  for  comparison:  first,  the  results  for  maskers  located  only  at  0°  or  +  90°  for  reversed 
speech  and  speech-shaped  speech-modulated  noise  are  plotted  as  R  and  N,  respectively.  Both 
the  noise  and  reversed  speech  are  intended  as  controls  for  the  energetic  masking  produced  by 
forward  speech;  however,  they  are  thought  to  produce  less  informational  masking  with  the  noise 
producing  very  little  informational  masking  and  the  reversed  speech  an  intermediate  amount. 
These  data  were  part  of  the  Marrone  et  al  (2008a)  study  and  were  obtained  using  the  same 
procedures  and  subjects.  The  other  data  points  are  from  subsequent  work  by  Kidd  et  al.  (2010) 
using  similar  procedures  and  stimuli.  The  parameter  that  is  varied  in  that  study  is  the  filtering  of 
the  speech  targets  and  maskers  (see  symbol  key),  which  were  broadband  (partial  replication  of 
Marrone  et  al.  2008a),  low-passed  at  1.5  kHZ,  band-passed  (1.5-3  kHz)  and  high-passed  (3 
kHz). 
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Overall  this  figure  is  intended  to  illustrate  several  points  regarding  spatial  tuning:  first,  highly 
tuned  responses  may  be  observed  under  conditions  dominated  by  informational  masking;  this  is 
apparent  from  the  solid  filter  function  based  on  Marrone  et  al.'s  data  and  other  related  work  from 
our  laboratory  (see  also  the  recent  work  by  Wan  et  al. ,  2010,  modeling  the  Marrone  et  al.  data 
using  a  modified  E-C  model).  Generally,  the  magnitude  of  SRM  varies  with  the  amount  of 
informational  masking  as  indicated  by  the  results  from  the  noise  and  reversed  speech  maskers. 
This  suggests  that  large  SRM  for  speech  identification  is  primarily  a  consequence  of  release 
from  informational  masking.  However,  in  order  to  achieve  this  large  SRM  and  sharply  tuned 
responses  the  observer  also  must  have  robust  binaural  information  to  use  in  focusing  attention 
in  azimuth.  This  conclusion  is  based  on  the  filtered  speech  results,  which  were  obtained  under 
high  informational  masking  conditions.  In  those  conditions,  low-pass  filtering  limited  the 
usefulness  of  ILDs  and  high-pass  filtering  limited  the  usefulness  of  ITDs.  Under  those  filtered 
conditions,  SRM  was  reduced  relative  to  the  broadband  case  and  the  pattern  of  attenuation,  to 
the  extent  that  it  could  be  ascertained,  was  less  sharply  tuned.  Thus  it  appears  that  when 
binaural  information  is  degraded  by  limiting  useable  interaural  time  or  level  differences,  both  the 
apparent  sharpness  of  tuning  and  the  magnitude  of  SRM  were  both  significantly  reduced. 

The  preceding  work  led  to  questions  regarding  the  extent  to  which  sound  sources  falling 
within  the  focus  of  attention  could  be  separated  perceptually  when  additional  spatially  separated 
sources  (presumably  outside  of  the  primary  focus  of  attention)  were  present.  The  work 
described  in  this  section  developed  in  stages.  Initially,  Brungart  et  al.  (2007)  reported  findings 
from  conditions  in  which  two  speech  maskers  were  varied  in  location  relative  to  a  target  speech 
source.  In  one  subset  of  conditions,  which  was  implemented  by  presenting  stimuli  through 
earphones  and  applying  HRTFs  to  create  spatialized  images,  one  masker  was  colocated  with 
the  target  and  a  second  masker  was  spatially  separated.  Separating  the  second  speech  masker 
did  not  improve  speech  recognition  performance  relative  to  the  case  in  which  both  maskers 
were  colocated  with  the  target.  This  finding  suggested  that  spatial  filtering  is  compromised  when 
a  complex  segregation  task  must  be  performed  at  the  point  of  focus  of  attention;  i.e.,  at  the 
target  location.  That  result,  however,  appeared  to  be  inconsistent  with  some  unpublished  work 
from  the  Psychoacoustics  Laboratory  at  BU.  In  order  to  understand  the  reasons  underlying  the 
different  findings,  Dr.  Virginia  Best  visited  the  laboratory  at  WPAFB  and  collaborated  on  a  series 
of  experiments  conducted  over  several  days  that  investigated  the  role  of  some  of  the  differences 
in  design  between  the  two  studies.  A  summary  of  a  portion  of  those  results  is  shown  in  Figure  2 
(Best  et  al.,  unpublished). 
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In  this  figure,  group  mean  proportion 
correct  speech  identification 
performance  is  plotted  for  three  target 
and  masker  configurations  (always  two 
independent  speech  maskers):  target 
and  maskers  colocated  (TMM),  one 
masker  colocated  with  the  target  and 
one  separated  (TM-M),  and  both 
maskers  separated  (T-M-M)  as  a 
function  of  target  to  masker  ratio.  In 
general,  no  advantage  was  found  for 
separating  only  a  single  masker  while 
very  large  advantages  (greater  than  10 
dB)  were  apparent  when  both  maskers 
were  separated.  These  results  generally 
supported  the  earlier  report  of  Brungart  et  al.  (2007).  However,  the  recently  published  findings  of 
Kidd  et  al.  (2010)  suggest  a  more  complex  picture  in  which  a  variety  of  factors  influence  tuning 
in  mixed-location  conditions  and  suggest  that  under  some  conditions  significant  SRM  may  occur 
even  when  one  masker  is  colocated  with  the  target.  In  both  the  Best  et  al.  study  and  the  Kidd  et 
al.  study  threshold  (roughly  50%  correct)  occurred  around  1-4  dB.  In  the  Best  et  al.  study 
moving  one  masker  away  from  the  target  did  not  improve  performance.  In  the  Kidd  et  al.  study, 
however,  50%  correct  performance  was  achieved  under  a  similar  combined  colocated- 
separated  masking  condition  for  a  target-to-masker  ratio  around  -11  dB,  yielding  a  SRM  of  12 
dB.  One  crucial  variable  that  appears  to  underlie  the  large  benefit  of  moving  one  masker  off  the 
point  of  focus  is  the  very  low  threshold  found  when  there  was  only  a  single  masker  talker 
colocated  with  the  target.  The  group  mean  threshold  found  by  Kidd  et  al.  for  a  1 -talker  masker 
colocated  with  the  target  was  about  -22  dB.  Thus,  adding  a  second  independent  colocated 
masker  talker  raised  target  threshold  by  23  dB  -  an  enormous  increase  in  masking.  Although  the 
Best  et  al.  study  replicated  the  earlier  report  by  Brungart  et  al.,  it  did  not  measure  the  single 
masker  talker  condition  when  target  and  masker  were  colocated.  This  leaves  open  the 
possibility,  but  does  not  prove,  that  the  degree  of  difficulty  in  the  segregation  task  at  the  point  of 
focus  of  attention  is  key.  In  the  Kidd  et  al.  study,  as  with  a  number  of  related  findings, 
performance  when  there  are  two  independent  speech  maskers  colocated  with  a  target  talker  is 
relatively  stable  (small  variability  across  subjects  and  studies)  at  a  slightly  positive  target-to- 
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masker  ratio.  We  believe  that  this  positive  value  is  due  to  a  high  degree  of  informational 
masking  and  limited  cues  for  segregating  the  target;  for  example,  reversing  the  two  speech 
maskers,  which  preserved  the  energetic  masking  but  greatly  reduced  the  informational  masking 
present,  decreased  thresholds  by  12  dB  consistent  with  the  presence  of  a  high  degree  of 
informational  masking.  However,  when  only  a  single  masker  talker  is  present,  the  variability 
across  subjects  and  studies  is  quite  large  and  appears  to  depend  heavily  on  the  specifics  of  the 
speech  materials  and  presentation  conditions.  Thus,  under  some  conditions  the  listener  may 
easily  segregate  the  target  and  (single)  masker  in  the  colocated  case  leading  to  the  very  low 
threshold  T/Ms,  such  as  those  found  by  Kidd  et  al.  In  that  case,  spatial  filtering  may  provide  a 
very  large  benefit  by  attenuating  the  separated  talker  when  a  second  masker  talker  is  added.  If 
the  segregation  task  is  difficult  even  for  one  talker  colocated  with  the  target  (which  we  speculate 
may  have  been  the  case  for  the  conditions  tested  by  Brungart  et  al.  and  Best  et  al.)  then 
attenuating  the  second  spatially  separated  talker  would  have  little  effect.  This  issue  reveals  the 
complexity  of  the  interactions  that  may  take  place  among  multiple  sources  and  how  the 
magnitude  of  the  advantage  of  spatial  separation  depends  on  the  cues  available  to  the  listener 
and  the  degree  of  informational  masking  present. 

II. B.  Stream  Formation,  Segregation  and  Maintenance  over  Time 

This  aim  is  broad  encompassing  many  aspects  of  current  research  in  hearing. 
Understanding  what  comprises  an  auditory  stream  and  how  the  listener  maintains  the  linkage 
between  the  elements  of  the  stream  under  the  pressure  of  competing  maskers  is  fundamental  to 
understanding  multisource  listening.  The  work  in  this  area  attempted  to  examine  which  factors 
bind  sounds  together  to  form  auditory  streams  focusing  primarily  on  speech  with  some  of  the 
more  recent  experiments  extended  to  include  nonspeech  patterns. 

The  first  work  under  this  aim  used  a  new  adaptation  of  the  procedure  originally  developed 
by  Broadbent  (1952).  In  this  procedure,  two  sentences  are  presented  to  the  listener  in 
alternating  word  format.  So  the  words  from  talker  A  are  the  odd-numbered  words  in  the 
sequence  while  the  words  from  talker  B  are  the  even-numbered  words  in  the  sentence. 
Because  the  words  from  the  two  talkers  do  not  overlap  in  time  there  is  no  simultaneous  (or 
energetic)  masking.  Our  control  conditions  also  indicate  that  for  the  parameters  used  in  the 
experiments  any  forward  masking  is  inconsequential.  Thus,  the  masking  that  occurs  -  which 
may  be  quite  significant  -  is  all  informational  masking.  This  paradigm  is  very  useful  for 
examining  speech  features  or  stimulus  variables  that  link  sounds  together  perceptually  or 
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semantically.  The  initial  work  using  this  procedure  and  a  specifically  designed  speech  corpus 
was  reported  by  Kidd  et  al.  (2008b).  One  subset  of  those  findings  will  be  reviewed  here  in  detail. 
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is  relative  to  the  case  in  which  both  sequences  have  random  word  order.  First,  large  benefits  to 
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manipulations  tested  in  the  study  which  varied  masker  predictability;  only  manipulations  linking 
the  target  words  together  were  beneficial.  This  latter  point  is  considered  further  below  in  the 
context  of  the  Listener  Max-Min  observer  strategies.  The  finding  regarding  the  benefit  of  correct 
syntax  in  overcoming  informational  masking  is  important  because  it  reveals  that  observer 
expectation  may  play  a  critical  role  in  stream  maintenance  and  the  reception  of  information 
conveyed  by  the  stream.  Target  words  presented  in  random  order  test  serial  recall.  Imposing 
syntactic  order  to  word  sequences  increases  the  predictability  of  the  words  and  conforms  to  the 
rules  of  normal  language.  While  it  is  expected  that  recall  would  be  better  for  syntactically 
presented  words  than  randomly  ordered  words,  the  significant  finding  here  is  the  difference  in 
the  vulnerability  of  the  materials  to  informational  masking.  This  finding  suggests  that  listener 
expectation  may  be  very  important  in  multisource  listening  under  high  informational  conditions. 


1  2  3  4  5 
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A  second  project  examining  streaming  was  conducted  jointly  with  the  WPAFB  group.  Iyer  et 
al.  (2009)  used  the  speech  corpus  developed  by  Kidd  et  al.  (2008b)  in  a  novel  experimental 
paradigm  in  which  listeners  identified  target  words  played  in  repeating  loops.  As  in  the 
adaptation  of  the  Broadbent  (1952)  procedure  described  above,  masker  stimuli  were  temporally 
interleaved  with  the  target  words.  The  maskers  varied  in  informational  masking  content 
including  noise,  reversed  speech  and  forward  (intelligible)  speech.  In  addition  to  the 
informational  masking  value  of  the  maskers,  stimulus  parameters  promoting 
grouping/segregation  were  manipulated.  These  included  fundamental  frequency  differences 
between  speech  targets  and  maskers  (in  semitones)  and  inter-onset  interval  for  the  100-ms 
duration  test  and  masker  items.  These  parameters  assessed  performance  under  conditions 
similar  to  those  used  in  the  common  A-B-A  tone  sequence  streaming  experiments  (cf.  van 
Noorden,  1975;  review  in  Bregman,  1990).  Figure  4  shows  some  of  the  findings  from  this 
study.  The  ordinate  is  group  mean  percent  correct  identification  of  the  target  words  while  the 
abscissa  is  time  between  item  onsets.  The  shaded  region  at  the  top  of  the  graph  shows 


Fig.  4 
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identification 
performance  when  the 
masker  was  reversed 
speech  having  the 
same  fundamental 
frequency  as  the  target 
(0  semitones  F0 
difference),  which  forms 
a  reference  for 
comparison  with  the 
high-informational 
masking  forward- 
masker  speech  results. 
Noise  maskers  had  little 
effect  on  intelligibility 
and  those  results  are 
not  shown.  The 
functions  indicate 
performance  for 
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differences  in  fundamental  frequency  between  target  words  and  masker  words.  This  result  is  in 
agreement  with  the  Kidd  et  al.  (2008b)  findings  regarding  the  benefit  of  a  constant  target  voice 
in  reducing  the  informational  masking  produced  by  competing  voices.  As  seen  in  Figure  4,  the 
greater  the  separation  in  F0  the  better  the  performance  at  any  temporal  separation  with 
improving  performance  found  as  the  time  between  elements  was  lengthened.  The  largest 
advantage  of  forward-  vs.  reversed-masker  speech  is  indicated  by  the  difference  between  the 
shaded  region  at  the  top  and  the  lowest  function  which  were  both  measured  at  0  semitones  F0 
separation.  These  findings  demonstrate  that  stream  segregation  for  sentences  behaves  in  a 
similar  manner  to  the  A-B-A  tone-sequence  streaming  results  (e.g.,  Bregman,  1990)  in  that 
performance  improves  with  increasing  frequency/F0  separation;  however,  they  differ  in  that 
listeners  were  better  able  to  process  the  target  speech  stream  when  rate  was  slower  possibly 
due  to  linguistic  processing  factors. 

II. C.  Develop  and  a  Test  Quantitative  Model  of  IM:  Ideal  Processing,  Acceptance  vs. 
Rejection  Processing  and  Stimulus  Uncertainty 

The  work  in  this  section  continued  a  line  of  research  into  the  role  of  two  hypothetical 
observer  strategies  that  describe  how  attention  may  act  to  improve  source  selection.  The  theory 
behind  this  work  is  an  extension  of  the  conceptual  framework  originally  developed  by  Durlach 
(1963;  1972)  and  embodied  in  the  equalization-cancelation  model  (EC).  The  basic  idea,  as 
applied  to  the  role  of  attention  in  source  selection,  is  relatively  simple:  the  observer  controls 
filters  distributed  along  a  particular  stimulus  dimension(s);  for  example,  frequency  or  azimuth.  In 
one  mode  of  operation,  the  observer  selects  the  filter  containing  the  signal  and  enhances  the 
output  of  that  filter  relative  to  other  filters.  This  represents  an  acceptance-filter  approach  also 
called  Listener  Max  (LMax)  because  it  maximizes  performance  within  the  desired  filter(s).  In 
contrast,  the  observer  can  apply  rejection  filtering  or  "nulls"  at  locations  where  undesired 
sources  are  present,  reducing  their  outputs.  This  approach  is  termed  Listener  Min  (LMm) 
because  the  processing  minimizes  the  effect  of  unwanted  sources.  Both  approaches  may  be 
useful  in  segregating/selecting  a  target  source  among  competing  maskers.  Although  a  simple 
filter  selection  process  is  the  example  given  here  acceptance  or  rejection  filtering  may  be  much 
more  complex  and  the  conceptual  framework  has  proved  to  be  useful  in  a  wide  variety  of 
applications,  many  of  which  involve  spatial  processing  of  sound  sources  (e.g.,  Akeroyd,  2004; 
Gallun  et  al.,  2005). 

During  the  past  award  period,  several  experimental  studies  examining  LMax  and  LMin 
observer  models  have  been  conducted.  As  a  general  summary  statement,  we  have  found 


9 


AFOSR  FA9950-08-0424 


Kidd,  Gerald  Jr.,  Ph.D.,  PI 


support  for  both  LMax  and  LMin  mechanisms  in  different  conditions.  A  portion  of  the  evidence 
supporting  each  will  be  reviewed  briefly  below.  A  general  observation,  though,  about  the 
approach  to  studying  these  hypothetical  observer  models:  it  is  difficult  to  devise  any  experiment 
that  conclusively  rules  out  one  particular  strategy.  Instead,  the  approach  that  is  usually  taken  is 
one  in  which  the  control  condition  randomizes  both  target  and  masker  values  along  the  relevant 
dimension.  The  comparison  condition  then  holds  constant  the  value  of  one  of  the  types  of 
stimuli  -  target  or  masker  -  so  that  the  observer  knows  beforehand  which  stimulus  should 
receive  particular  emphasis/preprocessing.  Thus,  if  advantages  are  found  for  fixed-target 
conditions,  we  infer  that  the  observer  adopts  or  exploits  an  LMax  strategy,  maximizing  the  target 
properties.  Conversely,  if  the  masker  value  is  fixed  any  performance  advantages  may  be 
attributed  to  an  LMin  process.  Although  somewhat  indirect,  this  approach  yields  sensible  patterns 
of  results.  The  fact  that  the  different  experiments  support  either  LMax  or  LMin  makes  it  difficult  to 
propose  a  single  model  to  account  for  listener  performance.  However,  the  notion  that  human 
observers  have  multiple  strategies  available  for  use  in  solving  complex  listening  tasks  -  and 
apply  them  according  to  circumstance  -  is  neither  surprising  nor  new.  The  challenge  is  to  find 
consistencies  in  the  patterns  of  results  implicating  broad  categories  of  task  demands  for  which 
one  strategy  or  the  other  is  optimal. 

The  first  example  of  support  for  an  LMax  model  may  be  found  by  inspection  of  Figure  3 
above.  Using  the  every-other-word  speech  identification  task,  Kidd  et  al.  (2008b)  compared 
performance  in  a  control  condition  in  which  the  relevant  parameters  of  both  target  and  masker 
words  were  randomized  across  trials  with  comparison  conditions  in  which  either  or  both  target 
and  masker  values  were  fixed  across  trials.  In  the  Kidd  et  al.  study,  correct  syntactic  structure 
was  one  such  "linkage  variable"  (as  per  Figure  3)  as  were  constant  talker  voice  and  apparent 
spatial  location.  In  all  cases,  Kidd  et  al.  found  significant  benefits  when  the  target  values  were 
held  constant  but  no  corresponding  benefits  when  masker  values  were  held  constant.  Thus, 
they  concluded  that  their  findings  were  consistent  with  an  LMax  observer  strategy  with  no  support 
found  for  an  LMin  strategy. 

A  second  recent  study  using  nonspeech  stimuli  also  found  support  for  an  LMax  observer 
model.  In  this  case,  Kidd  et  al.  (2008c)  measured  detection  thresholds  under  certain  and 
uncertain  target  frequency  conditions.  In  the  certain  condition,  the  target  tone  frequency  was 
held  constant  ("fixed")  across  each  block  of  trials  while  in  the  uncertain  condition  the  target 
frequency  was  chosen  randomly  on  each  trial  ("random").  Thresholds  for  two  target  frequencies 
were  measured;  one  at  a  relatively  low  frequency  and  one  at  a  relatively  high  frequency.  Three 
conditions  were  tested  for  each:  an  unmasked  control,  a  notched-filtered  Gaussian  noise 


10 


AFOSR  FA9950-08-0424 


Kidd,  Gerald  Jr.,  Ph.D.,  PI 


Fig.  5 
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masker,  and  a  multitone 
masker  whose  frequencies 
were  randomly  selected 
presentation-by-presentation 
(cf.  Neff  and  Green,  1987; 
review  in  Kidd  et  al. ,  2008a). 
The  notched-noise  and 
multitone  maskers  are 
illustrated  schematically  in 
Figure  5  along  with  the  two 
target  frequencies  (red  lines 


inside  of  "protected  regions"  where  masker  energy  was  excluded). 

In  the  random-frequency  condition,  the  listener  had  to  monitor  two  frequency  regions  and 
make  detection  judgments  about  each  while  in  the  fixed-target  frequency  case  only  one 
frequency  region  had  to  be  monitored  by  the  listener.  The  assumption  was  that  any  benefit  to 
detection  performance  due  to  holding  target  frequency  constant  could  be  related  to  an  LMax 
strategy  in  which  the  observer  emphasized  the  processing  at  the  known  (and  therefore 
attended)  frequency  region.  Figure  6  illustrates  the  results.  This  figure  shows  thresholds  for 
fixed-frequency  and  random-frequency  targets  (upper  panel)  in  quiet,  notched-noise  and 
multitone  maskers  as  well  as  the  "costs"  (difference  between  fixed-  and  random-frequency 
thresholds;  lower  panel)  for 

Fig.  6 
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each.  The  important  finding 
from  this  study  was  the  much 
larger  costs  associated  with 
target  frequency  uncertainty  for 
the  highly  informational 
multitone  masker.  This  finding 
is  still  not  fully  understood  and 
work  continues  to  explain  and 
model  the  results.  Conversely, 
one  can  think  of  the  costs  of 
uncertainty  as  indicating  the 
benefit  afforded  by  an  LMax 
observer  strategy. 
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Recently,  we  have 
begun  devising  a  model 
to  account  for  these 
effects  (Thompson  and 
Kidd,  2011).  To  date,  the 
initial  efforts  capture 
some  of  the  important 
effects  for  the  differences 
in  masking  due  to 
energetic  (notched-noise) 
vs.  informational 

(randomized  multitone) 
maskers,  but  do  not  yet 
successfully  predict  the 
costs  for  the  different 
masker  types.  However,  for  the  fixed-frequency  conditions,  the  thresholds  and  slopes  of  the 
underlying  psychometric  functions  are  predicted  to  a  rough  first-order  approximation.  This  is 
illustrated  in  Figure  7.  Shown  here  are  the  model  predictions  plotted  as  psychometric  functions 
(solid  and  dashed  lines)  along  with  the  corresponding  group  mean  data  points  (squares)  from 
Thompson  and  Kidd  (2011).  This  model  uses  the  physiologically-inspired  preprocessing  front 
end  described  by  Dau  et  al.  (1996)  that  consists  of  an  outer/middle  ear  transfer  function, 
gammatone  cochlear  filter  bank,  half-wave  rectification  and  low-pass  filtering,  and  adaptation 
loops.  The  decision  process  is  based  on  an  ideal  detector  in  which  the  signal  is  known  exactly 
and  the  masker  statistics  are  known  and  stationary  (cf.  Green  and  Swets,  1974).  The  model 
captures  the  difference  in  slope  with  masker  type  (data  for  slopes  not  shown,  cf.,  Kidd  et  al., 
1998,  2002).  We  are  currently  testing  an  alternative  approach  based  on  Lutfi's  CoRE  model 
(Lutfi,  1993;  Alexander  and  Lutfi,  2004)  exploring  modifications  that  can  capture  the  added  costs 
associated  with  the  interaction  between  target  and  masker  frequency  uncertainty. 

The  preceding  findings  may  be  construed  as  providing  evidence  for  the  benefit  of  an  LMax 
observer  model.  This  is  because  a  priori  knowledge  about  the  target  in  a  highly  uncertain 
informational  masking  listening  situation  improved  performance.  However,  as  noted  above,  we 
have  also  found  equally  convincing  evidence  in  support  of  an  l_Min  observer  model.  Kidd  et  al. 
(2011)  examined  contextual  effects  in  the  identification  of  complex,  nonspeech  sounds.  The 
identification  task,  which  we  have  used  in  past  studies  of  energetic  and  informational  masking, 


Fig.  7 
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Fig.  8  requires  subjects  to  learn  six  pure-tone  sequences  that 

differ  in  the  order  of  frequencies  within  a  narrow  band.  A 
schematic  illustration  of  this  set  of  spectrotemporal  patterns 
is  shown  in  Figure  8.  Note  that  these  representations  are 
of  relative  frequency.  The  narrowband  patterns  (e.g.,  total 
range  usually  about  14%  of  the  nominal  "center  frequency") 
are  easily  identifiable  in  quiet  regardless  of  the  absolute 
frequency  at  which  they  are  presented.  This  makes  this 
stimulus  set  a  good  choice  for  studying  mechanisms  of 
masking  at  suprathreshold  levels  because  masker  energy 
may  be  overlaid  directly  on  the  patterns  causing 
predominantly  energetic  masking  or  may  be  presented 
Time  remote  in  frequency  to  the  targets  creating  informational 

masking  (e.g.,  Kidd  et  al.,  1998,  2002).  They  are  also  well-suited  for  examining  auditory 
processing  of  sequences  of  sounds,  especially  when  varying  target  and/or  masker  uncertainty  is 
of  interest. 

Figure  9  illustrates  one  trial  of  an  experiment  that  used  the  nonspeech  pattern  identification 
task  to  examine  sequential  interactions  among  stimuli  (Kidd  et  al.  2011).  This  figure  is  a 
schematic  in  sound  spectrogram  form  of  a  target  (bold,  one  of  the  six  patterns  shown  in  Figure 
8)  masked  by  a  multitone 

masker  randomized  in  frequency  Fig.  9 

content  from  trial  to  trial.  Except 
for  one  condition  discussed  654 

below,  the  target  frequency  was 
also  chosen  randomly  on  every 
trial.  The  task  is  to  identify  the 
target  pattern  in  a  1I6AFC 
paradigm.  In  this  illustration,  the  | 

O" 

target-plus-masker  (right  portion  ® 

LL 

of  panel)  is  preceded  by  a 
cue/precursor.  In  the  schematic, 
the  precursor  shown  is  an  exact 

20 

copy  of  the  subsequent  masker. 

The  contextual  effects  that  Kidd  Masker  Cue  Silence  Masker+Target 
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et  al.  were  interested  in  studying  tend  to  emphasize  or  "enhance"  spectral  contrasts  (cf.  the 
"enhancement  effect":  Viemeister  1980;  Viemiester  and  Bacon,  1982;  Summerfield  et  al.,  1987; 
Byrne  et  al.,  2011).  Various  precursors  were  tested  including  the  exact  copy  of  the  masker 
shown  in  Figure  9  ("masker  cue")  as  well  as  that  precursor  presented  to  the  opposite  ear  from 
the  target-plus-masker  ("contralateral  masker  cue")  and  a  notched-filtered  noise  having  a  notch 
centered  on  the  target  presented  ipsilateral  to  the  target-plus-masker  ("noise  cue").  One 
condition  without  a  precursor  tested  the  effectiveness  of  an  LMax  strategy  by  holding  the  target 
frequency  constant  across  trials  ("fixed  target").  The  improvement  in  performance  from  each  of 
these  contextual  cues,  relative  to  an  uncued  control  condition  in  which  both  the  target  and 
masker  were  randomized  across  trials,  is  plotted  in  Figure  10  in  rationalized  arcsine  units 
(RAUs).  First  of  all,  no  significant  benefit  was  found  for  holding  the  target  frequency  constant.  In 
this  highly  uncertain  experiment,  this  result  does  not  support  the  actions  of  an  LMax  listener 
strategy.  In  contrast,  the  three  masker  precursors  did  provide  a  significant  benefit,  with  the 
greatest  advantage  found  for  the  exact  masker  precursor.  That  finding  supports  an  effective  LMin 
strategy.  However,  a  portion  of  the  effect  -  that  revealed  by  the  small  but  significant  notched- 
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noise  benefit  -  may  reflect  a  component  of  auditory  enhancement  that  depends  solely  on 
differential  prior  stimulation  of  masker  and  target  channels.  This  enhancement  effect  likely  is 
due  to  bottom-up  inhibitory  processes  that  are  not  directed  by  attentional  control  or  a  priori 
knowledge.  In  contrast,  the  somewhat  larger  contralateral  exact-masker  cue  benefit  may  be 
complementary  and  directed  by  top-down  mechanisms.  This  latter  finding  may  be  related  to  the 
contralateral  contrast  enhancement  effects  in  sequences  of  speech  sounds  reported  by  Holt  and 
colleagues  (Holt,  2005,  2006a, b;  Lotto  et  al.,  2003). 


Fig.  10 
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